140
Applications in Natural Language Processing
FIGURE 5.14
Attention-head view for (a) full-precision BERT, (b) fully binarized BERT baseline, and
(c) BiBERT for same input. BiBERT with Bi-Attention shows similar behavior with the
full-precision model, while baseline suffers indistinguishable attention for information degra-
dation.
ited capabilities, the ideal binarized representation should preserve the given full-precision
counterparts as much as possible means the mutual information between binarized and
full-precision representations should be maximized. When the deterministic sign function
is applied to binarize BERT, the goal is equivalent to maximizing the information entropy
H(B) of binarized representation B [171], which is defined as
H(B) = −
B
p(B) log p(B),
(5.26)
where B ∈{−1, 1} is the random variable sampled from B with probability mass function
p. Therefore, the information entropy of binarized representation should be maximized to
better preserve the full-precision counterparts and let the attention mechanism function
well.
As for the attention structure in full-precision BERT, the normalized attention weight
obtained by softmax is essential. But direct application of binarization function causes a
complete information loss to binarized attention weight. Specifically, since the softmax(A)
is regarded as following a probability distribution, the elements of Bs
A are all quantized to
1 (Fig. 5.14(b)) and the information entropy H(Bs
A) degenerates to 0. A common measure
to alleviate this information degradation is to shift the distribution of input tensors before
applying the sign function, which is formulated as
ˆBs
A = sign (softmax(A) −τ) ,
(5.27)
where the shift parameter τ, also regarded as the threshold of binarization, is expected to
maximize the entropy of the binarized ˆBs
A and is fixed during the inference. Moreover, the
attention weight obtained by the sign function is binarized to {−1, 1}, while the original
attention weight has a normalized value range [0, 1]. The negative value of attention weight
in the binarized architecture is contrary to the intuition of the existing attention mechanism
and is also empirically proved to be harmful to the attention structure.
To mitigate the information degradation caused by binarization in the attention mech-
anism, the authors introduced an efficient Bi-Attention structure for fully binarized BERT,
which maximizes information entropy of binarized representations statistically and applies
bitwise operations for fast inference. In detail, they proposed to binarize the attention weight
into the Boolean value, while the design is driven by information entropy maximization. In
Bi-Attention, bool function is leveraged to binarize the attention score A, which is defined
as
bool(x) =
1,
if x ≥0
0,
otherwise ,
(5.28)